COMPARISON OF THE EFFECTS OF LEXICAL AND ONTOLOGICAL INFORMATION ON TEXT CATEGORIZATION by CESAR KOIRALA

نویسندگان

  • Khaled Rasheed
  • Cesar Koirala
  • Walter D. Potter
  • Nash Unsworth
  • Maureen Grasso
  • Michael A. Covington
چکیده

ON TEXT CATEGORIZATION by CESAR KOIRALA (Under the Direction of Khaled Rasheed) ABSTRACT This thesis compares the effectiveness of using lexical and ontological information for text categorization. Lexical information has been induced using stemmed features. Ontological information, on the other hand, has been induced in the form of WordNet hypernyms. Text representations based on stemming and WordNet hypernyms were evaluated using four different machine learning algorithms on two datasets. The research reports average F1 measures as the results. The results show that, for the larger dataset, stemming-based text representation gives better performance than hypernym-based text representation even though the later uses a novel hypernym formation approach. However, for the smaller data set with relatively lower feature overlap, hypernym-based text representations produce results that are comparable to the stemming-based text representation. The results also indicate that combining stemming-based representation and hypernym-based representation produces an improvement in the performance for the smaller dataset.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Title of a Literary Text as a Discursive Phenomenon

Modern text linguistics pays serious attention to the significant structural elements of the text, which carry special knowledge. Such structural elements include the title. In this article, the title is considered as a linguistic and cognitive characteristic and a spatially fixed structural element of the text – «frame», which is located around/before/behind the text, focusing on the importanc...

متن کامل

Automatic Acquisition of Hyponyms and Meronyms from Question Corpora

We explore how lexical and ontological relations can be acquired automatically from natural language questions. The focus in this paper is on identifying hyponym and meronym relations by using simple pattern matching. It is shown that natural language questions can provide a significant source for ontological information.

متن کامل

Interfacing Lexical and Ontological Information in a Multilingual Soccer FrameNet

This paper presents ongoing work on a multilingual (English, French, German) lexical resource of soccer language. The first part describes how lexicographic descriptions based on frame-semantic principles are derived from a partially aligned multilingual corpus of soccer match reports. The remainder of the paper then discusses how different types of ontological knowledge are linked to this reso...

متن کامل

Automating Ontological Annotation with WordNet

Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation...

متن کامل

Ontological Annotation with WordNet

Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008